── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
library(lubridate)library(dplyr)
Step 1: Read in data and check it
#Read in the EPA data from 2002twozero <-fread("2002.csv")#Read in the EPA data from 2022twotwo <-fread("2022.csv")
Step 1a. Check the dimensions
dim(twozero)
[1] 15976 20
The dimensions of twozero, the 2002 data, is 15,976 rows/observations by 20 columns.
dim(twotwo)
[1] 56140 20
The dimensions of twotwo, the 2022 data, is 56,140 rows/observations by 20 columns.
Date Source Site ID POC Daily Mean PM2.5 Concentration UNITS
1: 01/05/2002 AQS 60010007 1 25.1 ug/m3 LC
2: 01/06/2002 AQS 60010007 1 31.6 ug/m3 LC
3: 01/08/2002 AQS 60010007 1 21.4 ug/m3 LC
4: 01/11/2002 AQS 60010007 1 25.9 ug/m3 LC
5: 01/14/2002 AQS 60010007 1 34.5 ug/m3 LC
6: 01/17/2002 AQS 60010007 1 41.0 ug/m3 LC
DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1: 78 Livermore 1 100
2: 92 Livermore 1 100
3: 71 Livermore 1 100
4: 80 Livermore 1 100
5: 98 Livermore 1 100
6: 115 Livermore 1 100
AQS_PARAMETER_CODE AQS_PARAMETER_DESC CBSA_CODE
1: 88101 PM2.5 - Local Conditions 41860
2: 88101 PM2.5 - Local Conditions 41860
3: 88101 PM2.5 - Local Conditions 41860
4: 88101 PM2.5 - Local Conditions 41860
5: 88101 PM2.5 - Local Conditions 41860
6: 88101 PM2.5 - Local Conditions 41860
CBSA_NAME STATE_CODE STATE COUNTY_CODE COUNTY
1: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
2: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
3: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
4: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
5: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
6: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
SITE_LATITUDE SITE_LONGITUDE
1: 37.68753 -121.7842
2: 37.68753 -121.7842
3: 37.68753 -121.7842
4: 37.68753 -121.7842
5: 37.68753 -121.7842
6: 37.68753 -121.7842
tail(twozero)
Date Source Site ID POC Daily Mean PM2.5 Concentration UNITS
1: 12/10/2002 AQS 61131003 1 15 ug/m3 LC
2: 12/13/2002 AQS 61131003 1 15 ug/m3 LC
3: 12/22/2002 AQS 61131003 1 1 ug/m3 LC
4: 12/25/2002 AQS 61131003 1 23 ug/m3 LC
5: 12/28/2002 AQS 61131003 1 5 ug/m3 LC
6: 12/31/2002 AQS 61131003 1 6 ug/m3 LC
DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1: 57 Woodland-Gibson Road 1 100
2: 57 Woodland-Gibson Road 1 100
3: 4 Woodland-Gibson Road 1 100
4: 74 Woodland-Gibson Road 1 100
5: 21 Woodland-Gibson Road 1 100
6: 25 Woodland-Gibson Road 1 100
AQS_PARAMETER_CODE AQS_PARAMETER_DESC CBSA_CODE
1: 88101 PM2.5 - Local Conditions 40900
2: 88101 PM2.5 - Local Conditions 40900
3: 88101 PM2.5 - Local Conditions 40900
4: 88101 PM2.5 - Local Conditions 40900
5: 88101 PM2.5 - Local Conditions 40900
6: 88101 PM2.5 - Local Conditions 40900
CBSA_NAME STATE_CODE STATE COUNTY_CODE
1: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
2: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
3: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
4: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
5: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
6: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
COUNTY SITE_LATITUDE SITE_LONGITUDE
1: Yolo 38.66121 -121.7327
2: Yolo 38.66121 -121.7327
3: Yolo 38.66121 -121.7327
4: Yolo 38.66121 -121.7327
5: Yolo 38.66121 -121.7327
6: Yolo 38.66121 -121.7327
head(twotwo)
Date Source Site ID POC Daily Mean PM2.5 Concentration UNITS
1: 01/01/2022 AQS 60010007 3 12.7 ug/m3 LC
2: 01/02/2022 AQS 60010007 3 13.9 ug/m3 LC
3: 01/03/2022 AQS 60010007 3 7.1 ug/m3 LC
4: 01/04/2022 AQS 60010007 3 3.7 ug/m3 LC
5: 01/05/2022 AQS 60010007 3 4.2 ug/m3 LC
6: 01/06/2022 AQS 60010007 3 3.8 ug/m3 LC
DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1: 52 Livermore 1 100
2: 55 Livermore 1 100
3: 30 Livermore 1 100
4: 15 Livermore 1 100
5: 18 Livermore 1 100
6: 16 Livermore 1 100
AQS_PARAMETER_CODE AQS_PARAMETER_DESC CBSA_CODE
1: 88101 PM2.5 - Local Conditions 41860
2: 88101 PM2.5 - Local Conditions 41860
3: 88101 PM2.5 - Local Conditions 41860
4: 88101 PM2.5 - Local Conditions 41860
5: 88101 PM2.5 - Local Conditions 41860
6: 88101 PM2.5 - Local Conditions 41860
CBSA_NAME STATE_CODE STATE COUNTY_CODE COUNTY
1: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
2: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
3: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
4: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
5: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
6: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
SITE_LATITUDE SITE_LONGITUDE
1: 37.68753 -121.7842
2: 37.68753 -121.7842
3: 37.68753 -121.7842
4: 37.68753 -121.7842
5: 37.68753 -121.7842
6: 37.68753 -121.7842
tail(twotwo)
Date Source Site ID POC Daily Mean PM2.5 Concentration UNITS
1: 12/01/2022 AQS 61131003 1 3.4 ug/m3 LC
2: 12/07/2022 AQS 61131003 1 3.8 ug/m3 LC
3: 12/13/2022 AQS 61131003 1 6.0 ug/m3 LC
4: 12/19/2022 AQS 61131003 1 34.8 ug/m3 LC
5: 12/25/2022 AQS 61131003 1 23.2 ug/m3 LC
6: 12/31/2022 AQS 61131003 1 1.0 ug/m3 LC
DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1: 14 Woodland-Gibson Road 1 100
2: 16 Woodland-Gibson Road 1 100
3: 25 Woodland-Gibson Road 1 100
4: 99 Woodland-Gibson Road 1 100
5: 74 Woodland-Gibson Road 1 100
6: 4 Woodland-Gibson Road 1 100
AQS_PARAMETER_CODE AQS_PARAMETER_DESC CBSA_CODE
1: 88101 PM2.5 - Local Conditions 40900
2: 88101 PM2.5 - Local Conditions 40900
3: 88101 PM2.5 - Local Conditions 40900
4: 88101 PM2.5 - Local Conditions 40900
5: 88101 PM2.5 - Local Conditions 40900
6: 88101 PM2.5 - Local Conditions 40900
CBSA_NAME STATE_CODE STATE COUNTY_CODE
1: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
2: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
3: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
4: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
5: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
6: Sacramento--Roseville--Arden-Arcade, CA 6 California 113
COUNTY SITE_LATITUDE SITE_LONGITUDE
1: Yolo 38.66121 -121.7327
2: Yolo 38.66121 -121.7327
3: Yolo 38.66121 -121.7327
4: Yolo 38.66121 -121.7327
5: Yolo 38.66121 -121.7327
6: Yolo 38.66121 -121.7327
Step 1c. Checking the key variable Daily Mean PM 2.5
table(is.na(twozero$`Daily Mean PM2.5 Concentration`))
FALSE
15976
table(is.na(twotwo$`Daily Mean PM2.5 Concentration`))
FALSE
56140
Step 1d. Check the summaries
summary(twozero$`Daily Mean PM2.5 Concentration`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 7.00 12.00 16.12 20.50 104.30
summary(twotwo$`Daily Mean PM2.5 Concentration`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.20 4.20 6.90 8.52 10.80 302.50
table(is.na(twotwo))
FALSE TRUE
1118601 4199
table(is.na(twozero))
FALSE TRUE
318591 929
Summary findings: After importing in the data from the 2002 and 2022 data sets there were 4199 hits of NA’s and 929 hits of NA. Looking at our key variables of Daily Mean PM2.5 concentration, in 2002 there was a minimum concentration of 0.00 and max concentration of 104.30 with a median of 16.12 and average of 16.12. In 2022 the minimum was -2.20 and max of 302.50 which is a significant difference than 2002. However the median was 6.90 and mean of 10.80 which was an overall decreased compared to 2002.
Step 2. Combine data
Step 2a. New year column identifier
twozero$year <-2002
twotwo$year <-2022
Step 2b. Renaming key variables for 2002 and 2022 data sets
twozero$PM2.5<- twozero$`Daily Mean PM2.5 Concentration`twozero$`Daily Mean PM2.5 Concentration`<-NULLtwozero$lat <- twozero$SITE_LATITUDEtwozero$SITE_LATITUDE <-NULLtwozero$lon <- twozero$SITE_LONGITUDEtwozero$SITE_LONGITUDE <-NULLtwotwo$PM2.5<- twotwo$`Daily Mean PM2.5 Concentration`twotwo$`Daily Mean PM2.5 Concentration`<-NULLtwotwo$lat <- twotwo$SITE_LATITUDEtwotwo$SITE_LATITUDE <-NULLtwotwo$lon <- twotwo$SITE_LONGITUDEtwotwo$SITE_LONGITUDE <-NULL
Date Source Site ID POC UNITS DAILY_AQI_VALUE
1: 01/05/2002 AQS 60010007 1 ug/m3 LC 78
2: 01/06/2002 AQS 60010007 1 ug/m3 LC 92
3: 01/08/2002 AQS 60010007 1 ug/m3 LC 71
4: 01/11/2002 AQS 60010007 1 ug/m3 LC 80
5: 01/14/2002 AQS 60010007 1 ug/m3 LC 98
---
72112: 12/07/2022 AQS 61131003 1 ug/m3 LC 16
72113: 12/13/2022 AQS 61131003 1 ug/m3 LC 25
72114: 12/19/2022 AQS 61131003 1 ug/m3 LC 99
72115: 12/25/2022 AQS 61131003 1 ug/m3 LC 74
72116: 12/31/2022 AQS 61131003 1 ug/m3 LC 4
Site Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
1: Livermore 1 100 88101
2: Livermore 1 100 88101
3: Livermore 1 100 88101
4: Livermore 1 100 88101
5: Livermore 1 100 88101
---
72112: Woodland-Gibson Road 1 100 88101
72113: Woodland-Gibson Road 1 100 88101
72114: Woodland-Gibson Road 1 100 88101
72115: Woodland-Gibson Road 1 100 88101
72116: Woodland-Gibson Road 1 100 88101
AQS_PARAMETER_DESC CBSA_CODE
1: PM2.5 - Local Conditions 41860
2: PM2.5 - Local Conditions 41860
3: PM2.5 - Local Conditions 41860
4: PM2.5 - Local Conditions 41860
5: PM2.5 - Local Conditions 41860
---
72112: PM2.5 - Local Conditions 40900
72113: PM2.5 - Local Conditions 40900
72114: PM2.5 - Local Conditions 40900
72115: PM2.5 - Local Conditions 40900
72116: PM2.5 - Local Conditions 40900
CBSA_NAME STATE_CODE STATE
1: San Francisco-Oakland-Hayward, CA 6 California
2: San Francisco-Oakland-Hayward, CA 6 California
3: San Francisco-Oakland-Hayward, CA 6 California
4: San Francisco-Oakland-Hayward, CA 6 California
5: San Francisco-Oakland-Hayward, CA 6 California
---
72112: Sacramento--Roseville--Arden-Arcade, CA 6 California
72113: Sacramento--Roseville--Arden-Arcade, CA 6 California
72114: Sacramento--Roseville--Arden-Arcade, CA 6 California
72115: Sacramento--Roseville--Arden-Arcade, CA 6 California
72116: Sacramento--Roseville--Arden-Arcade, CA 6 California
COUNTY_CODE COUNTY year PM2.5 lat lon
1: 1 Alameda 2002 25.1 37.68753 -121.7842
2: 1 Alameda 2002 31.6 37.68753 -121.7842
3: 1 Alameda 2002 21.4 37.68753 -121.7842
4: 1 Alameda 2002 25.9 37.68753 -121.7842
5: 1 Alameda 2002 34.5 37.68753 -121.7842
---
72112: 113 Yolo 2022 3.8 38.66121 -121.7327
72113: 113 Yolo 2022 6.0 38.66121 -121.7327
72114: 113 Yolo 2022 34.8 38.66121 -121.7327
72115: 113 Yolo 2022 23.2 38.66121 -121.7327
72116: 113 Yolo 2022 1.0 38.66121 -121.7327
dim(combined_data)
[1] 72116 21
The dimensions of 72,116 observations represents the addition of the 15,976 observations from the 2002 data with the 56,140 observations from the 2022 data.
Compared to the 2002 location of sites, in 2022 there seemed to be an expansion of sites along the Coastal border and in central CA. Previously, the 2002 sites were along the major cities and Eastern border.
Step 4. Check for missing data
table(is.na(combined_data$PM2.5))
FALSE
72116
After running the test above I can conclude that there are no missing values for PM2.5. The potential implausible data might include the 302.50 PM 2.5 concentration seen in the max value of the 2022 data.
Step 5. Three Different Spatial Levels
2002 and 2022 Data at State Level
hist(twozero$PM2.5, breaks=100)
hist(twotwo$PM2.5, breaks=100)
summary(twozero$PM2.5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 7.00 12.00 16.12 20.50 104.30
summary(twotwo$PM2.5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.20 4.20 6.90 8.52 10.80 302.50
twozero_summaryPM2.5<-summary(twozero$PM2.5)twotwo_summaryPM2.5<-summary(twotwo$PM2.5)median_val0 <- twozero_summaryPM2.5["Median"]q10_val <- twozero_summaryPM2.5["1st Qu."]q30_val <- twozero_summaryPM2.5["3rd Qu."]median_val <- twotwo_summaryPM2.5["Median"]q1_val <- twotwo_summaryPM2.5["1st Qu."]q3_val <- twotwo_summaryPM2.5["3rd Qu."]# Create a boxplot using the extracted summary statisticsboxplot( median_val0, # Median q10_val, # 1st Quartile q30_val, # 3rd Quartilemain ="Summary Boxplot of 2002",names =c("Median", "1st Quartile", "3rd Quartile"),ylab ="Values")
From a state level, we can see that the 2022 median and averages have been shown to decrease compared to 2002. However, we can also see that in 2022 there were many more outlier values that far exceeded the maximums seen in 2002.
2002 and 2022 Data at County Level - Los Angeles County Code 37
# Data for LA County 2002LACOUNTY <-subset(twozero, COUNTY =='Los Angeles')# Plotting a histogramhist(LACOUNTY$PM2.5, main ="Histogram of PM2.5 in Los Angeles", xlab ="PM2.5")
boxplot(LACOUNTY$PM2.5,main ="Boxplot LA County 2002",xlab ="X-axis Label",ylab ="Y-axis Label",col ="darkgreen")
summary(LACOUNTY$PM2.5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.60 11.10 17.40 19.66 25.50 72.40
LACOUNTY$DATENUM <-as.Date.character(LACOUNTY$Date)plot(LACOUNTY$DATENUM,LACOUNTY$PM2.5, type ='l')
# Data for LA County 2022LACOUNTY2022 <-subset(twotwo, COUNTY =='Los Angeles')# Plotting a histogramhist(LACOUNTY2022$PM2.5, main ="Histogram of PM2.5 in Los Angeles in 2022", xlab ="PM2.5")
boxplot(LACOUNTY2022$PM2.5,main ="Boxplot LA County 2022",xlab ="X-axis Label",ylab ="Y-axis Label",col ="aquamarine")
summary(LACOUNTY2022$PM2.5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.20 7.40 10.30 10.97 13.70 56.00
LACOUNTY2022$DATENUM <-as.Date(LACOUNTY2022$Date)plot(LACOUNTY2022$DATENUM,LACOUNTY2022$PM2.5, type ='l')
As we can now see, there appears to be a greater amount of data recorded for 2022 in LA County as compared to 2002. In addition, when comparing the means and medians of PM 2.5, we can see that there is an overall trend of decrease in LA County daily concentration of PM 2.5 compared to 2002 (2022: 10.3 vs 2002: 17.4 for the medians). This continues to show that overall PM 2.5 concentration decreased in 2022 compared to 2002. Of note, in 2022 it looks like there was a spike in PM 2.5 during the summer months.
Site Name - Los Angeles-North Main Street
# Data for LA SITE2002twozero$SITENAME <- twozero$`Site Name`#Subset DataLASITE0 <- twozero[twozero$SITENAME =='Los Angeles-North Main Street',]# Plotting a histogramhist(LASITE0$PM2.5, main ="Histogram of PM2.5 in Los Angeles North Main Street", xlab ="PM2.5")
boxplot(LASITE0$PM2.5,main ="Boxplot LA SITE 2002",xlab ="X-axis Label",ylab ="Y-axis Label",col ="mediumaquamarine")
summary(LASITE0$PM2.5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.90 13.90 19.30 21.97 26.90 66.30
LASITE0$DATENUM <-as.Date.character(LASITE0$Date)plot(LASITE0$DATENUM,LASITE0$PM2.5, type ='l')
# Data for LA SITE2022twotwo$SITENAME <- twotwo$`Site Name`#Subset DataLASITE22 <- twotwo[twotwo$SITENAME =='Los Angeles-North Main Street',]# Plotting a histogramhist(LASITE22$PM2.5, main ="Histogram of PM2.5 in Los Angeles North Main Street 2022", xlab ="PM2.5")
boxplot(LASITE22$PM2.5,main ="Boxplot LA SITE 2022",xlab ="X-axis Label",ylab ="Y-axis Label",col ="mediumpurple")
summary(LASITE22$PM2.5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.400 8.325 10.900 11.583 14.300 38.000
LASITE22$DATENUM <-as.Date.character(LASITE22$Date)plot(LASITE22$DATENUM,LASITE22$PM2.5, type ='l')
Consistent with the previous patterns observed, we can see that this LA site showed a decrease in PM 2.5 daily concentrations in 2022 compared to 2002. Similar to the trends earlier, the median was 10.90 in 2022 and 19.30 in 2002. As we see the overall trends in the histogram, in 2022 there were outliers not spread as far out as seen in 2002.